Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Intel Mkl] Parallel BiasAddGrad op with eigen intra thread pool #26426

Merged

Conversation

wenxizhu
Copy link
Contributor

@wenxizhu wenxizhu commented Mar 7, 2019

The BiasAddGrad op is running on single thread, which badly influences the training performance of corresponding models.

We provide a optimized parallel implementation of BiasAddGrad op.

The following is the time cost for original and optimized BiasAddGrad op of some corresponding models. All run on skx-8180 single socket:
YoloV2 (batch size 16) : 263.76ms -> 12.06ms
Trans-LT (batch size 1024) : 68.61ms -> 3.09ms
Inception-ResV2 (batch size 64) : 334.96ms -> 73.18ms
NCF (batch size 1024) : 1.34ms -> 0.243ms
NCF-mlperf (batch size 1024) : 0.867ms -> 0.341ms
MaskRCNN (batch size 1) : 370.57ms -> 15.23ms
DCGAN (batch size 32) : 10.46ms -> 0.512ms

influences the training performance of corresponding models.

We provide a optimized parallel implementation of BiasAddGrad op.

Change-Id: I3a8da878eea67a4903a3b68302c6c86c3e536025
@rthadur rthadur requested a review from ezhulenev March 7, 2019 20:59
@rthadur rthadur self-assigned this Mar 7, 2019
@rthadur rthadur added this to Assigned Reviewer in PR Queue via automation Mar 7, 2019
@rthadur rthadur added awaiting review Pull request awaiting review size:M CL Change Size: Medium labels Mar 7, 2019
Copy link
Member

@ezhulenev ezhulenev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: I prepared this benchmark myself, should be in GitHub soon.


Original:
Could you also add bias_op_test.cc with micro benchmarks, you can use https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/sparse_xent_op_test.cc as an example. I will be able to use them to compare improvements with clang toolchain.

#include "tensorflow/core/framework/bounds_check.h"
#include "tensorflow/core/framework/numeric_op.h"
#include "tensorflow/core/framework/op_kernel.h"
#include "tensorflow/core/framework/register_types.h"
#include "tensorflow/core/framework/tensor.h"
#include "tensorflow/core/util/tensor_format.h"
#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"

#include "tensorflow/core/util/work_sharder.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please group all tensorflow includes together.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
}

private:
TensorFormat data_format_;

// Modified for performance tune :: new funcs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can strip all "Modified for performance" comments, they are not really helpful for the reader.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

inline void InplaceVecAdd(T* X, const T* Y, const int64 length) {
//#pragma simd
for (int64 i = 0; i < length; i++) {
X[i] = X[i] + Y[i];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if this pragma works with clang? This loop can be expressed as simple Eigen expression, something like:

    using X = Eigen::Map<Eigen::Array<T, Dynamic, 1>>;
    using Y = Eigen::Map<const Eigen::Array<T, Dynamic, 1>>
    X(x, length) += Y(y, length);

I probably messed up with some constructor parameters, but similar idea is used in https://bitbucket.org/eigen/eigen/src/9c300336de9a096c7c4c3b230a6ae9f49fa17aea/unsupported/Eigen/CXX11/src/Tensor/TensorBlock.h#lines-494:506

This will be vectorized/unrolled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, repalced them with eigen apis. And I also figured there is no need to keep the InplaceVecAdd() and VecSumReduce() fuctions. I inlined them into the lambda function.


// Modified for performance tune :: new funcs
// Apply X[0:length-1] = X[0:length-1] + Y[0:length-1];
inline void InplaceVecAdd(T* X, const T* Y, const int64 length) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

function arg names must be 'x' and 'y'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function has been removed, so that's not a problem now.

}
}
// Return sum(X[0:length-1])
inline T VecSumReduce(const T* X, const int64 length) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a. method arg name
b. it also could be return Eigen::Map<...>(x, length).sum();

I didn't find any other simd pragmas in Tensorflow codebase, so my assumption it's not supported by all compilers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function has been removed, so that's not a problem now.

.template cast<typename AccumulatorType<T>::type>()
.reshape(two_dims)
.sum(reduction_axis)
.template cast<T>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all performance problems with this grad were fixed in https://bitbucket.org/eigen/eigen/commits/948b43eaf2c5c35c988308d120a8a24781c63de6, what version of TF you used in benchmarks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to my measurement, I believe my implementation is still faster than this commit you mentioned. So could I keep using my implementation.

@ezhulenev
Copy link
Member

I did run benchmarks from https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/bias_op_test.cc and got pretty good results already, feel free to add more benchmarks, probably you want to test NCHW as well.

BM_BiasAddGradNHWC_32_32_32_128_cpu               4.84ms ± 1%             2.30ms ± 1%  -52.46%        (p=0.000 n=10+10)
BM_BiasAddGradNHWC_32_32_32_256_cpu               9.66ms ± 2%             4.15ms ± 1%  -57.08%        (p=0.000 n=10+10)
BM_BiasAddGradNHWC_32_32_32_512_cpu               80.1ms ±13%             18.4ms ± 4%  -76.97%        (p=0.000 n=10+10)
BM_BiasAddGradNHWC_32_32_32_1024_cpu               242ms ± 2%               41ms ±17%  -83.17%          (p=0.000 n=9+9)
BM_BiasAddGradNHWC_32_64_64_128_cpu               58.3ms ± 3%             18.5ms ± 5%  -68.16%        (p=0.000 n=10+10)
BM_BiasAddGradNHWC_32_64_64_256_cpu                177ms ± 5%               44ms ±12%  -75.48%         (p=0.000 n=9+10)
BM_BiasAddGradNHWC_32_64_64_512_cpu                452ms ± 6%               92ms ± 8%  -79.75%        (p=0.000 n=10+10)
BM_BiasAddGradNHWC_32_64_64_1024_cpu               1.19s ± 5%              0.23s ± 7%  -80.93%        (p=0.000 n=10+10)

name                                              old allocs/op           new allocs/op           delta
BM_BiasAddGradNHWC_32_32_32_128_cpu                  106 ± 0%                 90 ± 0%  -15.09%         (p=0.000 n=10+9)
BM_BiasAddGradNHWC_32_32_32_256_cpu                  170 ± 0%                 90 ± 0%  -47.06%         (p=0.000 n=10+9)
BM_BiasAddGradNHWC_32_32_32_512_cpu                  170 ± 0%                 90 ± 1%  -47.35%         (p=0.000 n=8+10)
BM_BiasAddGradNHWC_32_32_32_1024_cpu                 170 ± 0%                 90 ± 1%  -47.33%        (p=0.000 n=10+10)
BM_BiasAddGradNHWC_32_64_64_128_cpu                  107 ± 1%                 90 ± 0%  -15.57%         (p=0.000 n=10+9)
BM_BiasAddGradNHWC_32_64_64_256_cpu                  171 ± 0%                 90 ± 0%  -47.37%          (p=0.001 n=8+9)
BM_BiasAddGradNHWC_32_64_64_512_cpu                  170 ± 0%                 90 ± 1%  -47.33%        (p=0.000 n=10+10)
BM_BiasAddGradNHWC_32_64_64_1024_cpu                 170 ± 0%                 90 ± 1%  -47.48%        (p=0.000 n=10+10)

@tensorflowbutler tensorflowbutler removed the awaiting review Pull request awaiting review label Mar 8, 2019
  1. Group all tensorflow includes together.
  2. Delete unnecessary comments.
  3. Using eigen ops instead.

Change-Id: I67edcd71cb4feeaf8ab0c1820d2c011e3409344d
@ezhulenev
Copy link
Member

Based on your PR I've submitted very similar change in d5f6595, apparently this "reduce all outer dimension" is pretty common in other gradient kernels (e.g. in FusedBatchNorm https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/fused_batch_norm_op.cc#L215). Can you please move the code for NCHW case to redux_functors.h, and call it similar way. Not sure though what's a good name for it.

Currently it takes Tensor as an input, but it could be any Eigen expression, I'll work on it after this PR will be merged.

@JoursBleu
Copy link
Contributor

Hi @ezhulenev ,
I have read the codes in ReduceOuterDimensions.
So, you want me to add a functor for situations like:
(32, 11, 256, 256) -> (11);
or maybe more general like:
(D1, D2, ... , DN) -> (DM) (where M belong to set [1,N])
and then ref the new functor to do the reduce?

@ezhulenev
Copy link
Member

ezhulenev commented Mar 12, 2019 via email

@JoursBleu
Copy link
Contributor

@ezhulenev OK

@JoursBleu
Copy link
Contributor

@ezhulenev
Could you please provide the command to run the test in bias_op_test.cc?

@ezhulenev
Copy link
Member

bazel run .... -- --benchmarks=all I think something like that, but I'm not 100% sure, it might be some difference between internal/external tooling. Teng Lu (@Zantares) from Intel made it work in #26424 (comment), I guess he definitely has the answer :)

   reduce in tensorflow/core/kernels/redux_functor.h
2. Opt ReduceOuterDimensions for large inner dim.
3. Rewrite NCHW BiasAddGradOp with ReduceOuterDimensions.
   reduce in tensorflow/core/kernels/redux_functor.h
2. Opt ReduceOuterDimensions for large inner dim.
3. Rewrite NCHW BiasAddGradOp with ReduceOuterDimensions.
PR Queue automation moved this from Assigned Reviewer to Approved by Reviewer Mar 15, 2019
@rthadur rthadur added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Mar 15, 2019
@kokoro-team kokoro-team removed kokoro:force-run Tests on submitted change labels Mar 15, 2019
@ezhulenev
Copy link
Member

This fails internally with a error:


InvalidArgumentError: var and grad do not have the same shape[1] [3,1]
	 [[node adagrad/slot_11/update_matmul3_b/ApplyAdagrad (defined at {INTERNAL}/training_graph_generator.py:1749) ]]

Errors may have originated from an input operation.
Input Source operations connected to node adagrad/slot_11/update_matmul3_b/ApplyAdagrad:
 gradients/matmul3/BiasAdd_grad/BiasAddGrad (defined at {INTERNAL}//training_graph_generator.py:1306)	
 matmul3_b (defined at {INTERNAL})

didn't have time to debug it yet

@JoursBleu
Copy link
Contributor

@ezhulenev
OK, how can I reproduce the error? I can not find the file "training_graph_generator.py" anywhere...

@JoursBleu
Copy link
Contributor

@ezhulenev is there anything I can do?

@ezhulenev
Copy link
Member

I'll try to prepare reproducible test, right now it's a part of a large model and it's hard to tell whats going on.

@JoursBleu
Copy link
Contributor

@ezhulenev Thx. Just tell me if there is anything to do. We need this pr merged ASAP, cause several model is depending on this one.

for (int i = num_dims - num_reduce_dims; i < num_dims; ++i)
inner_dim *= input_dims[i];

if (1 == inner_dim) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it a bug? '32x32x1' should be reduced to scalar?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly! It should be "outer_dim" here.
Thank you very much for pointing out the bug!
Fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tensorflow/python/keras:convolutional_test

ValueError: Shapes (3,) and (1, 1, 1, 1, 3) are incompatible

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

PR Queue automation moved this from Approved by Reviewer to Reviewer Requested Changes Mar 22, 2019
PR Queue automation moved this from Reviewer Requested Changes to Approved by Reviewer Mar 22, 2019
ezhulenev
ezhulenev previously approved these changes Mar 22, 2019
@rthadur rthadur added the kokoro:force-run Tests on submitted change label Mar 22, 2019
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Mar 22, 2019
PR Queue automation moved this from Approved by Reviewer to Reviewer Requested Changes Mar 25, 2019
PR Queue automation moved this from Reviewer Requested Changes to Approved by Reviewer Mar 25, 2019
@tensorflow-copybara tensorflow-copybara merged commit 12a5578 into tensorflow:master Mar 25, 2019
PR Queue automation moved this from Approved by Reviewer to Merged Mar 25, 2019
tensorflow-copybara pushed a commit that referenced this pull request Mar 25, 2019
@nammbash nammbash deleted the letian/biasaddgrad_opt branch June 16, 2020 19:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes ready to pull PR ready for merge process size:M CL Change Size: Medium
Projects
PR Queue
  
Merged
Development

Successfully merging this pull request may close these issues.

None yet

8 participants